Goto

Collaborating Authors

 network size


LimitstoDepth-EfficienciesofSelf-Attention

Neural Information Processing Systems

Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).







191595dc11b4d6e54f01504e3aa92f96-Paper.pdf

Neural Information Processing Systems

Inthiswork, we focus on a classification problem and investigate the behavior of both noncalibrated and calibrated negativelog-likelihood (CNLL) ofadeep ensemble as a function of the ensemble size and the member network size.



On the Power and Limitations of Random Features for Understanding Neural Networks

Neural Information Processing Systems

Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these \emph{explicitly} leads to the well-known approach of learning with random features (e.g.


Symplectic Adjoint Method for Exact Gradient of Neural ODE with Minimal Memory

Neural Information Processing Systems

A neural network model of a differential equation, namely neural ODE, has enabled the learning of continuous-time dynamical systems and probabilistic distributions with high accuracy. The neural ODE uses the same network repeatedly during a numerical integration. The memory consumption of the backpropagation algorithm is proportional to the number of uses times the network size. This is true even if a checkpointing scheme divides the computation graph into sub-graphs.